%
% HLMBook - Steve Young    08/01/97
%
% Updated - Gareth Moore   15/01/02
%

\newpage
\mysect{LGPrep}{LGPrep}

\mysubsect{Function}{LGPrep-Function}

\index{lgprep@\htool{LGPrep}|(}

The function of this tool is to scan a language model training text
and generate a set of gram files holding the $n$-grams seen in the text
along with their counts.  By default, the output gram files are named
\texttt{gram.0}, \texttt{gram.1}, \texttt{gram.2}, etc. However, the 
root name can be changed using the \texttt{-r} option and the start
index can be set using the
\texttt{-i} option.  

Each output gram file is sorted but the files themselves will not be
sequenced (see section~\ref{s:gramfs}).  Thus, when using
\htool{LGPrep} with substantial training texts, it is good practice to
subsequently copy the complete set of output gram files using
\htool{LGCopy} to reorder them into sequence. This process will also
remove duplicate occurrences making the resultant files more compact
and faster to read by the \HLM\ processing tools.

Since \htool{LGPrep} will often encounter new words in its input, it
is necessary to update the word map.  The normal operation therefore
is that \htool{LGPrep} begins by reading in a word map containing all
the word ids required to decode all previously generated gram files.
This word map is then updated to include all the new words seen in the
current input text.  On completion, the updated word map is output to
a file of the same name as the input word map in the directory used to
store the new gram files.  Alternatively, it can be output to a
specified file using the \texttt{-w} option.  The sequence number in
the header of the newly created word map will be one greater than that
of the original.

\htool{LGPrep} can also apply a set of ``match and replace'' edit 
rules to the input text stream.  The purpose of this facility is not
to replace input text conditioning filters but to make simple changes
to the text after the main gram files have been generated.  The
editing works by passing the text through a window one word at a time.
The edit rules consist of a pattern and a replacement text. At each
step, the pattern of each rule is matched against the window and if a
match occurs, then the matched word sequence is replaced by the string
in the replaced part of the rule.  Two sets of gram files are
generated by this process.  A ``negative'' set of gram files contain
$n$-grams corresponding to just the text strings which were modified and
a ``positive'' set of gram files contain $n$-grams corresponding to the
modified text.  All text for which no rules matched is ignored and
generates no gram file output.  Once the positive and negative gram
files have been generated, the positive grams are added (i.e. input
with a weight of +1) to the original set of gram files and the
negative grams are subtracted (i.e. input with a weight of -1).  The
net result is that the tool reading the full set of gram files
receives a stream of $n$-grams which will be identical to the stream that
it would have received if the editing commands had been applied to the
text source when the original main gram file set had been generated.

The edit rules are stored in a file and read in using the \texttt{-f}
option.  They consist of set definitions and rule definitions, each
written on a separate line. Each set defines a set of words and is
identified by an integer in the range 0 to 255
\begin{verbatim}
    <set-def>     = '#'<number> <word1> <word2> ... <wordN>.
\end{verbatim}
For example, 
\begin{verbatim}
    #4 red green blue 
\end{verbatim}
defines set number 4 as being the 3 words ``red", ``green" and ``blue".  Rules
consist of an \textit{application factor}, a \textit{pattern} and and a
\textit{replacement}
\begin{verbatim}
    <rule-def>    = <app-factor> <pattern> : <replacement>
    <pattern>     = { <word> | '*' | !<set> | %<set> }
    <replacement> = { '$'<field> | string } % $' - work around emacs
                                            % colouring bug
\end{verbatim}
The application factor should be a real number in the range 0 to 1 and
it specifies the proportion of occurrences of the pattern which should
be replaced.  The pattern consists of a sequence of words, wildcard
symbols (``\texttt{*}") which match anyword, and set references of the
form \texttt{\%n} denoting any word which is in set number \texttt{n}
and \texttt{!n} denoting any word which is not in set number
\texttt{n}.  The replacement consists of a sequence of words and field
references of the form \texttt{\$i} which denotes the \texttt{i'th}
matching word in the input.

As an example, the following rules would translate 50\% of the
occurrences of numbers in the form ``one hundred fifty" to ``one
hundred and fifty" and 30\% of the occurrences of ``one hundred" to
``a hundred".
\begin{verbatim}
    #0 one two three four five six seven eight nine fifty sixty seventy
    #1 hundred
    0.5 * * hundred %0 * * : $0 $1 $2 and $3 $4 $5
    0.3 * * !0 one %1  * * : $0 $1 $2 a $4 $5 $6
\end{verbatim}
Note finally, that \htool{LGPrep} processes edited text in a parallel
stream to normal text, so it is possible to generate edited gram files
whilst generating the main gram file set.  However, normally the main
gram files already exist and so it is normal to suppress gram file
generation using the \texttt{-z} option when using edit rules.

\mysubsect{Use}{LGPrep-Use}

\htool{LGPrep} is invoked by typing the command line
\begin{verbatim}
   LGPrep [options] wordmap [textfile ...]
\end{verbatim}
Each text file is processed in turn and treated as a continuous stream
of words.  If no text files are specified standard input is used and
this is the more usual case since it allows the input text source to
be filtered before input to
\htool{LGPrep}, for example, using \htool{LCond.pl} (in {\tt LMTutorial/extras/}).

Each $n$-gram in the input stream is stored in a buffer.  When the buffer
is full it is sorted and multiple occurrences of the same $n$-gram are
merged and the count set accordingly.  When this process ceases to
yield sufficient buffer space, the contents are written to an output
gram file.

The word map file defines the mapping of source words to the numeric
ids used within \HLM\ tools.  Any words not in the map are allocated
new ids and added to the map.  On completion, a new map with the same
name (unless specified otherwise with the \texttt{-w} option) is
output to the same directory as the output gram files.  To initialise
the first invocation of this updating process, a word map file should
be created with a text editor containing the following:
\begin{verbatim}
    Name=xxxx
    SeqNo=0
    Language=yyyy
    Entries=0
    Fields=ID
    \Words\
\end{verbatim}
where \texttt{xxxx} is an arbitrarily chosen name for the word map and
\texttt{yyyy} is the language. Fields specifying the escaping mode to use
(\texttt{HTK} or \texttt{RAW}) and changing \texttt{Fields} to include
frequency counts in the output (i.e.\ \texttt{FIELDS = ID,WFC}) can
also be given.  Alternatively, they can be added to the output using
command line options.

The allowable options to \htool{LGPrep} are as follows

\begin{optlist}
  \ttitem{-a n} Allow upto \texttt{n} new words in input texts
  (default 100000).

  \ttitem{-b n} Set the internal gram buffer size to n (default
  2000000). \htool{LGPrep} stores incoming $n$-grams in this buffer.
  When the buffer is full, the contents are sorted and written to an
  output gram file.  Thus, the buffer size determines the amount of
  process memory that \htool{LGPrep} will use and the size of the
  individual output gram files.

  \ttitem{-c} Add word counts to the output word map.  This overrides
       the setting in the input word map (default off).

  \ttitem{-d} Directory in which to store the output gram files
             (default current directory).
        
  \ttitem{-e n} Set the internal edited gram buffer size to \texttt{n}
  (default 100000).

  \ttitem{-f s} Fix (i.e. edit) the text source using the rules in
	\texttt{s}.

  \ttitem{-h} Do not use HTK escaping in the output word map (default
              on).

  \ttitem{-i n} Set the index of the first gram file output 
             to be \texttt{n} (default 0).

  \ttitem{-n n} Set the output $n$-gram size to \texttt{n} (default 3).

  \ttitem{-q} Tag words at sentence start with underscore (\_).

  \ttitem{-r s} Set the root name of the output gram files to
       \texttt{s} (default ``gram'').

  \ttitem{-s s} Write the string \texttt{s} into the source field of
       the output gram files.  This string should be a comment
       describing the text source.

  \ttitem{-w s} Write the output map file to \texttt{s} (default same
      as input map name stored in the output gram directory).

  \ttitem{-z} Suppress gram file output. This option allows
      \htool{LGPrep} to be used just to compute a word frequency map.
      It is also normally applied when applying edit rules to the
      input.

  \stdoptQ
\end{optlist}
\stdopts{LGPrep}

\mysubsect{Tracing}{LGPrep-Tracing}

\htool{LGPrep} supports the following trace options where each
trace flag is given using an octal base
\begin{optlist}
\ttitem{00001}  basic progress reporting. 
\ttitem{00002}  monitor buffer save operations.
\ttitem{00004}  Trace word input stream.
\ttitem{00010}  Trace shift register input.
\ttitem{00020}  Rule input monitoring.
\ttitem{00040}  Print rule set.
\end{optlist}
Trace flags are set using the \texttt{-T} option or the \texttt{TRACE}
configuration variable.
\index{lgprep@\htool{LGPrep}|)}
